Group 11
Gabrielle Felicia Ariyanto - 2540134874
Natasha Hartanti Winata - 2502039176
Caroline Angelina Sunarya - 2501995093
Clarissa Octavia Tjandra - 2540120143
Agnes Calista - 2501980690
Source Dataset : ‘https://www.kaggle.com/code/burhanykiyakoglu/predicting-house-prices/notebook’
House Data is a collection of data about house prices in 2015 at King County area of the United States. House prices will be predicted using the linear regression approach.
#Import File
#The dataset from Kaggle is uploaded to github for access via the link
HouseData <- read.csv("https://raw.githubusercontent.com/GabrielleFeliciaA/house_price_data/main/kc_house_data.csv")
head(HouseData)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
`id` : (num) house id
`date` : (chr) the date the house was sold
`price` : (num) house price
`bedrooms` : (int) the number of rooms in a house
`bathrooms` : (num) the number of bathrooms in a house
`sqft_living` : (int) house area
`sqft_lot` : (int) land area
`floors` : (num) the number of floors in the house
`waterfront` : (int) does the house have a view of the water
`view` : (int) view rating
`condition` : (int) house condition
`grade` : (int) overall assessment of the house
`sqft_above` : (int) the area of the upper room of the house
`sqft_basement` : (int) basement area
`yr_built` : (int) year the house was built
`yr_renovated` : (int) year the house was renovated
`zipcode` : (int) zip code
`lat` : (num) latitude coordinates
`long` : (num) longitude coordinates
`sqft_living15` : (int) the size of the house in 2015 if renovated.
`sqft_lot15` : (int) size of land area in 2015 if renovated
#Packages
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(mapview)
## Warning: package 'mapview' was built under R version 4.1.3
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
dim(HouseData)
## [1] 21613 21
The result above indicated that there are 21613 observations (rows) and 21 variables (columns) of data.
str(HouseData)
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
There are three data types that were used in this data set, they are num, char, and int. To be specific, there are one (1) variable contained of char data type, six (6) variable contained of num data type, and fourteen (14) variable contained of int data type. The variable that consists of char data type is 'date'. The variables that consist of num data type is 'id', 'price', 'bathrooms', 'floors', 'lat', and 'long'. Other than all of mentioned before, other variables consist of int data type.
summary(HouseData)
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
The insights gained from the result above are:
- The maximum price of a house is 7700000 and the minimum price of a house is 75000.
- The oldest house(s) was/were built in 1900 and the newest house(s) was/were built in 2015.
- In the 'yr_renovated' variable, the minimum value is 0, the maximum value is 2015, the 1st quartile, median, and 3rd quartile all have the value 0. This may indicate that among all of the houses that were renovated, it is likely that either the houses were all renovated in the year 2015 or not renovated at all.
describe(HouseData)
## HouseData
##
## 21 Variables 21613 Observations
## --------------------------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 21436 1 4.58e+09 3.296e+09 5.125e+08 1.036e+09
## .25 .50 .75 .90 .95
## 2.123e+09 3.905e+09 7.309e+09 8.732e+09 9.297e+09
##
## lowest : 1000102 1200019 1200021 2800031 3600057
## highest: 9842300095 9842300485 9842300540 9895000040 9900000190
## --------------------------------------------------------------------------------
## date
## n missing distinct
## 21613 0 372
##
## lowest : 20140502T000000 20140503T000000 20140504T000000 20140505T000000 20140506T000000
## highest: 20150513T000000 20150514T000000 20150515T000000 20150524T000000 20150527T000000
## --------------------------------------------------------------------------------
## price
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 4028 1 540088 329387 210000 245000
## .25 .50 .75 .90 .95
## 321950 450000 645000 887000 1156480
##
## lowest : 75000 78000 80000 81000 82000
## highest: 5350000 5570000 6885000 7062500 7700000
## --------------------------------------------------------------------------------
## bedrooms
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 13 0.871 3.371 0.946 2 2
## .25 .50 .75 .90 .95
## 3 3 4 4 5
##
## lowest : 0 1 2 3 4, highest: 8 9 10 11 33
##
## Value 0 1 2 3 4 5 6 7 8 9 10
## Frequency 13 199 2760 9824 6882 1601 272 38 13 6 3
## Proportion 0.001 0.009 0.128 0.455 0.318 0.074 0.013 0.002 0.001 0.000 0.000
##
## Value 11 33
## Frequency 1 1
## Proportion 0.000 0.000
## --------------------------------------------------------------------------------
## bathrooms
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 30 0.974 2.115 0.8444 1.00 1.00
## .25 .50 .75 .90 .95
## 1.75 2.25 2.50 3.00 3.50
##
## lowest : 0.00 0.50 0.75 1.00 1.25, highest: 6.50 6.75 7.50 7.75 8.00
## --------------------------------------------------------------------------------
## sqft_living
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 1038 1 2080 978.4 940 1090
## .25 .50 .75 .90 .95
## 1427 1910 2550 3250 3760
##
## lowest : 290 370 380 384 390, highest: 9640 9890 10040 12050 13540
## --------------------------------------------------------------------------------
## sqft_lot
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 9782 1 15107 17855 1800 3322
## .25 .50 .75 .90 .95
## 5040 7618 10688 21398 43339
##
## lowest : 520 572 600 609 635
## highest: 982998 1024068 1074218 1164794 1651359
## --------------------------------------------------------------------------------
## floors
## n missing distinct Info Mean Gmd
## 21613 0 6 0.823 1.494 0.5563
##
## lowest : 1.0 1.5 2.0 2.5 3.0, highest: 1.5 2.0 2.5 3.0 3.5
##
## Value 1.0 1.5 2.0 2.5 3.0 3.5
## Frequency 10680 1910 8241 161 613 8
## Proportion 0.494 0.088 0.381 0.007 0.028 0.000
## --------------------------------------------------------------------------------
## waterfront
## n missing distinct Info Sum Mean Gmd
## 21613 0 2 0.022 163 0.007542 0.01497
##
## --------------------------------------------------------------------------------
## view
## n missing distinct Info Mean Gmd
## 21613 0 5 0.267 0.2343 0.4322
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 19489 332 963 510 319
## Proportion 0.902 0.015 0.045 0.024 0.015
## --------------------------------------------------------------------------------
## condition
## n missing distinct Info Mean Gmd
## 21613 0 5 0.708 3.409 0.6161
##
## lowest : 1 2 3 4 5, highest: 1 2 3 4 5
##
## Value 1 2 3 4 5
## Frequency 30 172 14031 5679 1701
## Proportion 0.001 0.008 0.649 0.263 0.079
## --------------------------------------------------------------------------------
## grade
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 12 0.903 7.657 1.231 6 6
## .25 .50 .75 .90 .95
## 7 7 8 9 10
##
## lowest : 1 3 4 5 6, highest: 9 10 11 12 13
##
## Value 1 3 4 5 6 7 8 9 10 11 12
## Frequency 1 3 29 242 2038 8981 6068 2615 1134 399 90
## Proportion 0.000 0.000 0.001 0.011 0.094 0.416 0.281 0.121 0.052 0.018 0.004
##
## Value 13
## Frequency 13
## Proportion 0.001
## --------------------------------------------------------------------------------
## sqft_above
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 946 1 1788 876.2 850 970
## .25 .50 .75 .90 .95
## 1190 1560 2210 2950 3400
##
## lowest : 290 370 380 384 390, highest: 7880 8020 8570 8860 9410
## --------------------------------------------------------------------------------
## sqft_basement
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 306 0.776 291.5 422.2 0 0
## .25 .50 .75 .90 .95
## 0 0 560 970 1190
##
## lowest : 0 10 20 40 50, highest: 3260 3480 3500 4130 4820
## --------------------------------------------------------------------------------
## yr_built
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 116 1 1971 33.38 1915 1926
## .25 .50 .75 .90 .95
## 1951 1975 1997 2007 2011
##
## lowest : 1900 1901 1902 1903 1904, highest: 2011 2012 2013 2014 2015
## --------------------------------------------------------------------------------
## yr_renovated
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 70 0.122 84.4 161.7 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## lowest : 0 1934 1940 1944 1945, highest: 2011 2012 2013 2014 2015
##
## Value 0 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980
## Frequency 20699 1 2 6 4 13 12 16 27 25 43
## Proportion 0.958 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.002
##
## Value 1985 1990 1995 2000 2005 2010 2015
## Frequency 88 99 84 112 156 82 144
## Proportion 0.004 0.005 0.004 0.005 0.007 0.004 0.007
##
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## zipcode
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 70 1 98078 60.77 98004 98008
## .25 .50 .75 .90 .95
## 98033 98065 98118 98155 98177
##
## lowest : 98001 98002 98003 98004 98005, highest: 98177 98178 98188 98198 98199
## --------------------------------------------------------------------------------
## lat
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 5034 1 47.56 0.1573 47.31 47.35
## .25 .50 .75 .90 .95
## 47.47 47.57 47.68 47.73 47.75
##
## lowest : 47.1559 47.1593 47.1622 47.1647 47.1764
## highest: 47.7771 47.7772 47.7774 47.7775 47.7776
## --------------------------------------------------------------------------------
## long
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 752 1 -122.2 0.1558 -122.4 -122.4
## .25 .50 .75 .90 .95
## -122.3 -122.2 -122.1 -122.0 -122.0
##
## lowest : -122.519 -122.515 -122.514 -122.512 -122.511
## highest: -121.325 -121.321 -121.319 -121.316 -121.315
## --------------------------------------------------------------------------------
## sqft_living15
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 777 1 1987 743.2 1140 1256
## .25 .50 .75 .90 .95
## 1490 1840 2360 2930 3300
##
## lowest : 399 460 620 670 690, highest: 5600 5610 5790 6110 6210
## --------------------------------------------------------------------------------
## sqft_lot15
## n missing distinct Info Mean Gmd .05 .10
## 21613 0 8689 1 12768 13404 1999 3667
## .25 .50 .75 .90 .95
## 5100 7620 10083 17852 37063
##
## lowest : 651 659 660 748 750, highest: 434728 438213 560617 858132 871200
## --------------------------------------------------------------------------------
The results above showed us:
- The least count of bedrooms in a house is 0, the greatest count of bedrooms in a house is 33, and the most frequent count of the bedrooms in a house is 3 with the value count 9824 contributing 45.5 percentile of data in the variable.
- Most houses had 1 or 2 floor levels. 10680 or 49.4 percentile of the houses only had 1 floor level, and 8241 or 38.1 percentile of the house only had 2 floor levels.
- The houses with the view rated at 0 were the houses that dominated the housing market, with 90.2 percentile of data contributed in the variable.
- The houses that were graded at level 3 were the ones dominating in the houses market. 14031 of the houses' condition were graded at level 3, contributing 64.9 percentile of the data in the variable.
- For the overall assessment of the houses, most house were graded at level 7 and 8. 8981 of the houses were graded at level 7, contributing 41.8 percentile of the data in the variable. 6068 of the houses were graded at level 8, contributing 28.1 percentile of the data in the variable.
- The summary above the most recent result indicated that the houses either was renovated at 2015 or not renovated at all. Through the most recent result, new insights were gained. There were several houses that were renovated in 1935, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, and not renovated at all. There were 20699 house that were not renovated at all, contributing almost 96 percentile of the data in the variable. Other than that, most houses were renovated either in 2005 or 2015. 156 houses were renovated in 2005, and 144 houses were renovated in 2015. Each contributed at least 0.7 percentile of the data in the variable.
colSums(is.na(HouseData))
## id date price bedrooms bathrooms
## 0 0 0 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15
## 0
There are no missing values calculated using the syntax above.
all(duplicated(HouseData)==TRUE)
## [1] FALSE
The syntax above returned the value FALSE. This indicates that there are no duplicated data in the data set. It can be concluded that the data set being used is free from missing values and duplicated data.
ThreeSigma <- function(x, t = 3){
mu <- mean(x, na.rm = TRUE)
sig <- sd(x, na.rm = TRUE)
if (sig == 0){
message("All non-missing x-values are identical")
}
up <- mu + t * sig
down <- mu - t * sig
out <- list(up = up, down = down)
return(out)
}
Hampel <- function(x, t = 3){
mu <- median(x, na.rm = TRUE)
sig <- mad(x, na.rm = TRUE)
if (sig == 0){
message("Hampel identifer implosion: MAD scale estimate is zero")
}
up <- mu + t * sig
down <- mu - t * sig
out <- list(up = up, down = down)
return(out)
}
BoxplotRule<- function(x, t = 1.5){
xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
Q <- xU - xL
if (Q == 0){
message("Boxplot rule implosion: interquartile distance is zero")
}
up <- xU + t * Q
down <- xL - t * Q
out <- list(up = up, down = down)
return(out)
}
ExtractDetails <- function(x, down, up){
outClass <- rep("N", length(x))
indexLo <- which(x < down)
indexHi <- which(x > up)
outClass[indexLo] <- "L"
outClass[indexHi] <- "U"
index <- union(indexLo, indexHi)
values <- x[index]
outClass <- outClass[index]
nOut <- length(index)
maxNom <- max(x[which(x <= up)])
minNom <- min(x[which(x >= down)])
outList <- list(nOut = nOut, lowLim = down,
upLim = up, minNom = minNom,
maxNom = maxNom, index = index,
values = values,
outClass = outClass)
return(outList)
}
FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
threeLims <- ThreeSigma(x, t = t3)
HampLims <- Hampel(x, t = tH)
boxLims <- BoxplotRule(x, t = tb)
n <- length(x)
nMiss <- length(which(is.na(x)))
threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
boxList <- ExtractDetails(x, boxLims$down, boxLims$up)
sumFrame <- data.frame(method = "ThreeSigma", n = n,
nMiss = nMiss, nOut = threeList$nOut,
lowLim = threeList$lowLim,
upLim = threeList$upLim,
minNom = threeList$minNom,
maxNom = threeList$maxNom)
upFrame <- data.frame(method = "Hampel", n = n,
nMiss = nMiss, nOut = HampList$nOut,
lowLim = HampList$lowLim,
upLim = HampList$upLim,
minNom = HampList$minNom,
maxNom = HampList$maxNom)
sumFrame <- rbind.data.frame(sumFrame, upFrame)
upFrame <- data.frame(method = "BoxplotRule", n = n,
nMiss = nMiss, nOut = boxList$nOut,
lowLim = boxList$lowLim,
upLim = boxList$upLim,
minNom = boxList$minNom,
maxNom = boxList$maxNom)
sumFrame <- rbind.data.frame(sumFrame, upFrame)
threeFrame <- data.frame(index = threeList$index,
values = threeList$values,
type = threeList$outClass)
HampFrame <- data.frame(index = HampList$index,
values = HampList$values,
type = HampList$outClass)
boxFrame <- data.frame(index = boxList$index,
values = boxList$values,
type = boxList$outClass)
outList <- list(summary = sumFrame, threeSigma = threeFrame,
Hampel = HampFrame, boxplotRule = boxFrame)
return(outList)
}
summary_outliers <- FindOutliers(HouseData$price)
summary_outliers$summary
## method n nMiss nOut lowLim upLim minNom maxNom
## 1 ThreeSigma 21613 0 406 -561293.4 1641470 75000 1640000
## 2 Hampel 21613 0 1166 -217170.0 1117170 75000 1115500
## 3 BoxplotRule 21613 0 1146 -162625.0 1129575 75000 1127500
avg <- mean(HouseData$price)
std <- sd(HouseData$price)
outliers_ts = sum(abs((HouseData$price - avg) > 3 * std))
upper_ts <- avg + 3 * std
lower_ts <- avg - 3 * std
outliers_ts <- list(up = upper_ts, down = lower_ts)
plot(HouseData$price,
main = "Three Sigma",
ylab = "Value",
col = 'blue', ylim= c(-1e+6,5e+06))
abline(h = mean(HouseData$price), lty = "dashed", lwd = 1)
abline(h = upper_ts, lty = "dotted", lwd = 2)
abline(h = lower_ts, lty = "dotted", lwd = 2)
med <- median(HouseData$price)
sig <- mad(HouseData$price)
data <- HouseData$price
outliers_h <- sum(abs(data - med) > 3 * sig)
upper_h <- med + 3 * sig
lower_h <- med - 3 * sig
outliers_hampel <- list(up = upper_h, down = lower_h)
plot(HouseData$price,
main = "Hampel Identifier",
ylab = "Value",
col = 'blue',ylim= c(-1e+6,5e+06))
abline(h = median(HouseData$price), lty = "dashed", lwd = 1)
abline(h = upper_h, lty = "dotted", lwd = 2)
abline(h = lower_h, lty = "dotted", lwd = 2)
out <- boxplot.stats(HouseData$price)$out
boxplot(HouseData$price,
ylab = "",
main = "House Price Boxplot"
)
mtext(paste("Outliers: ", paste(length(out), collapse = ", ")))
Using the Three Sigma Edit rule, the lower limit of the non-outlier data is -561293.4 with the minNom value is 75000, and the upper limit is 1641470 with the maxNom value is 1640000. As a result, there are 406 data that are considered outliers by this rule. Using the Hample Identifier rule, the lower limit of the non-outlier data is -217170.0 with the minNom value is 75000, and the upper limit is 1117170 with the maxNom value is 1115500. There are 1166 data that are considered outliers. Lastly, using the Boxplot rule, the lower limit of the non-outlier data is -162625.0 with the minNom value is 75000, and the upper limit is 1129575 with the maxNom value is 1127500. There are 1146 data that are considered outliers. The lower limit that seemed to be the most reasonable for the non-outlier data is from the Boxplot rule as the numbers did not stray far away from the minNom value. Compared to all of the upper limit from all of three outlier detector rule, the most reasonable upper limit for the non-outlier data is from the Hample Identifier rule. The upper limit from the Hample rule is the smallest among all of the other rules. The upper limit from Hample rule did not stray far away from the central distribution of the data.
Using the combination of the lower limit and the upper limit of Three Sigma, Hample rule, and Boxplot rule, there are 406 data points, 1166 data points, and 1146 data points that are considered as outliers respectively. Referencing from the results from the FindOutliers() function and the plot above, the most reliable results for outliers identifier is from the Hample Identifier rule, which identified 1166 data points as outliers.
outlierIndex_table <- which(HouseData$price > 1117170 | HouseData$price < -162625.0)
slice_data <- slice(HouseData, outlierIndex_table)
head(slice_data)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## 2 2524049179 20140826T000000 2000000 3 2.75 3050 44867
## 3 822039084 20150311T000000 1350000 3 2.50 2753 65005
## 4 1802000060 20140612T000000 1325000 5 2.25 3200 20158
## 5 4389200955 20150302T000000 1450000 4 2.75 2750 17789
## 6 7855801670 20150401T000000 2250000 4 3.25 5180 19850
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1.0 0 0 3 11 3890 1530 2001
## 2 1.0 0 4 3 9 2330 720 1968
## 3 1.0 1 2 5 9 2165 588 1953
## 4 1.0 0 0 3 8 1600 1600 1965
## 5 1.5 0 0 3 8 1980 770 1914
## 6 2.0 0 3 3 12 3540 1640 2006
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98053 47.6561 -122.005 4760 101930
## 2 0 98040 47.5316 -122.233 4110 20336
## 3 0 98070 47.4041 -122.451 2680 72513
## 4 0 98004 47.6303 -122.215 3390 20158
## 5 1992 98004 47.6141 -122.212 3060 11275
## 6 0 98006 47.5620 -122.162 3160 9750
Outliers were discovered in the price variable after searching for anomalies in the dataset, but our team opted not to remove them since prices that are out of range might be generated by a variety of factors, such as growing material prices, rising labor prices, inflation, and rising cost of living and there are several additional factors that determine.
canvas <- layout(matrix(c(1,2,3,4),nrow=2,byrow=TRUE))
plot(HouseData$sqft_living, HouseData$bathrooms,main="sqft_living VS Bathrooms", xlab="Sqft of Living Room",ylab="Bathrooms")
plot(HouseData$sqft_living, HouseData$sqft_living,main="sqft_living VS sqft_living15",xlab="Sqft of Living Room",ylab="Sqft of Living Room in 2015")
plot(HouseData$sqft_above, HouseData$sqft_living,main="sqft_above VS sqft_living",xlab="Sqft of the Above",ylab="Sqft of Living Room")
plot(HouseData$lat, HouseData$bedrooms,main="lat VS bedrooms",xlab="Lat",ylab="Bedrooms")
From the output above :
- 'sqft_living' and 'bathrooms' seems to be related. So, As the result, the higher the value of 'sqft_living', the higher the number of 'bathrooms', and vice versa.
- It reflect that 'sqft_living' and 'sqft_living15' variables have perfectly positive relation. So, As the result, the higher the value of 'sqft_living', the higher the value of 'sqft_living15', and vice versa.
- 'sqft_living' and 'sqft_above' variables are positively related. So, the higher the value of 'sqft_above', the higher the value of 'sqft_living', and vice versa.
- 'lat' and 'bedrooms' variables seems do not have relation, because it lacks neither ascending nor descending trend.
PATTERN DISCOVERY
HousePattern <- HouseData
HousePattern$view = as.factor(HousePattern$view)
ggplot(HousePattern, aes(x=view, y=price)) + geom_boxplot() + ggtitle("House Price vs. Rating View")
According to the output above, houses with views rated at 3 and 4 have higher house prices than houses with views at levels 0, 1, or 2.
HousePattern$bedrooms = as.factor(HousePattern$bedrooms)
ggplot(HousePattern, aes(x=bedrooms, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Bedrooms")
Looking at the 2nd quartile of the data, houses with 9 bedrooms held the highest price. However, looking at the overall distribution of the data, houses with 8 bedrooms has the highest price.
HousePattern$bathrooms = as.factor(HousePattern$bathrooms)
ggplot(HousePattern, aes(x=bathrooms, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Bathrooms")
The plot define that House prices below 1,000,000 dollars mostly have less than 3 bathrooms, and least houses have bathrooms count higher than 3, ranged in 3.25 and 3.5.
HousePattern$floors = as.factor(HousePattern$floors)
ggplot(HousePattern, aes(x=floors, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Floors")
The result above shows that a house with a 2.5 floor has the highest price when compared to other floor levels.
HousePattern$waterfront = as.factor(HousePattern$waterfront)
ggplot(HousePattern, aes(x=waterfront, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Waterfront")
The result above shows that houses with a waterfront demand a higher house price than houses without a waterfront.
HousePattern$condition = as.factor(HousePattern$condition)
ggplot(HousePattern, aes(x=condition, y=price)) + geom_boxplot() + ggtitle("House Price vs. House Condition Rating")
Houses with conditions at levels 3, 4, and 5 are more expensive than houses with conditions at levels 1 or 2. The houses condition at level 5 is the most expensive.
HousePattern$grade = as.factor(HousePattern$grade)
ggplot(HousePattern, aes(x=grade, y=price)) + geom_boxplot() + ggtitle("House Price vs. House Grade")
The plot above shows a positive correlation between the house price and the house grade variables. It can be concluded that the higher the grade of a house, the higher the price of the house.
HousePattern$isRenovated <- as.logical(HousePattern$yr_renovated)
ggplot(HousePattern, aes(x=isRenovated, y=price)) + geom_boxplot() + ggtitle("House Price vs. is Renovated")
Renovated houses are likely to have a higher price compare to Non-Renovated houses.
mapview(HousePattern, xcol = "long", ycol = "lat", zcol = "price", crs = 4269, grid = FALSE)